Skip to main content

Website Content Analysis Tool

Purpose

This Python application is designed to help analyze and compare the structure and content of multiple websites to determine best practices. By examining key elements such as the number of images, videos, etc., on each page, which pages are most prevalent per website category, and other customizable parameters, the tool aims to provide insights that can be used to improve website design, SEO strategies, and user experience.

Main Parameters We Need to Take Into Account

The scraper should analyze each domain and identify similarities across the websites, focusing on the following areas:

1. Sub-pages and Common Pages

  • How many sub-pages are there on average across these sites?
  • What pages are commonly found on these sites (e.g., landing page, About Us page, contact page, etc.)?

2. Common Supported Content

  • What type of content is frequently found across these domains?
    • Team Member sections: Are these located on the landing page or the "About Us" page?
    • Videos: Where are they commonly placed?
  • Elements to look for:
    • Team Member sections
    • FAQs
    • Testimonials (e.g., "Love this product - John Doe, CEO")
    • Clients (e.g., "These brands trust us," "Some of our clients")
    • Maps (e.g., Google Maps)
    • Food menus
    • Pricing tables (e.g., pricing tiers)
    • Client logos or names
  • What similarities exist across these sites?

3. Text Content

  • How much text content is generally found on these common pages?
    • Text content thresholds:
      • 1500 words = "paragraph heavy"
      • Above 200 words and under 1500 words = "informative"

4. Common Unsupported Content

  • What unsupported content is frequently found on these sites?
    • Examples:
      • "Apply for job" or career elements
      • E-commerce functionality
      • Booking functionality
      • Calendar functionality

5. Contact Methods

  • What are the common ways visitors can contact these businesses?
    • Examples:
      • Contact forms
      • Phone numbers
      • Email links
      • WhatsApp links

6. Videos

  • Is it common for these sites to include videos? If yes, how many videos do these sites typically have? Check this for all subpages too.

7. Images

  • On average, how many images do the sites in this scope have?

8. Content Presentation on Common Pages

  • How do these sites usually present content on common pages like the "About Us" page (if this is a common element)?
    • Example structure for the "About" section of marketing websites:
      • Company Overview: A brief introduction, including the company's history, mission, and vision. Often, it highlights the company's values and what differentiates it from competitors.
      • Mission and Vision Statements: A clear description of the company's purpose (mission) and long-term goals (vision), helping align with potential clients' or customers' values.
      • Company History: A narrative tracing the company's origins, major milestones, and growth over time, possibly including founding stories and pivotal moments.
      • Team or Leadership Profiles: Information on key team members, their roles, backgrounds, and expertise, which humanizes the brand and builds trust by showcasing experience and credentials.
      • Core Values or Philosophy: A description of the principles guiding the company's operations, often tied to the brand's mission and resonating with the audience's values.
      • Testimonials or Case Studies: Social proof in the form of testimonials or case studies that demonstrate the company's success and impact, building credibility.
      • Contact Information or Links: A section or link to the company's contact page or direct contact details, making it easy for visitors to reach out.
    • This structure ensures that the "About" section is comprehensive, informative, effective in creating a connection with potential clients or partners.

9. Question Generation from Objective 7

  • Create questions based on the analysis of objective 7. These questions will later help generate content similarities across the domains.

10. Saving Results in JSON

  • The results of all the above objectives must be saved in a JSON format.
  • Use OpenAI API to achieve objectives 8 and 9.
  • The overall goal is to find similarities across these domains in scope and analyze their structure and content.

Features

  • Image Counting: Automatically counts the number of images (<img> tags) on each page of a website.
  • "About" Page Detection: Checks if the website has an "About" page by looking for internal links with "about" in their URLs.
  • Customizable Analysis: You can easily extend the application to check for other parameters, such as meta tags, videos, headers, etc.
  • Automated Reporting: Generates a CSV report that summarizes the analysis, making it easy to compare multiple websites.

How It Works

This application uses BeautifulSoup4, a powerful Python library for web scraping, to fetch and parse the HTML content of websites. By identifying specific HTML elements (e.g., <img> tags or <a> links), it collects data that can be used to compare various websites' practices.

For each website, the application performs the following tasks:

  1. Fetches and parses the website's HTML content.
  2. Counts the number of images on the page.
  3. Checks if there is a link to an "About" page.
  4. Compiles the data into a structured report in an appropriate category JSON file.

Requirements

To run this application, you need to have the following Python packages installed:

  • requests: Used to make HTTP requests to fetch website content.
  • beautifulsoup4: A popular library for parsing HTML and extracting data from websites.
  • pandas: A data manipulation library used to generate the CSV report.

You can install these dependencies by running the following command:

pip install requests beautifulsoup4 pandas

Usage

  1. Clone the repository to your local machine.

  2. Install the required dependencies using the command above.

  3. Add the websites you want to analyze to the websites list in the analyze_website function.

  4. Run the script to analyze the websites and generate a JSON file.

curl -X POST http://localhost:5000/process_websites -H "Content-Type: application/json" -d "{\"websites\": [\"web1\", \"web2\"]}"

Technology used

  • Chart.js to show the data in charts
  • Beautiful soup for scraping of data

Website Categories we need data from

Categories

Part 1

  1. Manufacturing Industry
  2. Law & Order
  3. Events
  4. Health
  5. Consultancy
  6. IT-Consultancy
  7. Beauty
  8. Construction

Part 2

  1. Art
  2. Marketing
  3. Education
  4. Food Stores (Catering)
  5. Real Estate Brokers
  6. Home Decor (Interior Designer)
  7. Traveling
  8. Logistics

Part 3

  1. Religion
  2. Restaurants
  3. Healthcare
  4. Animal Care (Veterinarian)
  5. Sports Club
  6. Music Artists
  7. Architects
  8. Web Development

Part 4

  1. Graphic Design
  2. Electronics
  3. Fashion (Fashion Designer)
  4. Dentists
  5. Pharmacy
  6. Farming
  7. Insurance
  8. Hair Salon

Part 5

  1. Culture Museums
  2. Finance
  3. Energy
  4. Auto Mechanic
  5. Politics
  6. Gardening Stores
  7. Single Property Vacation Rental